Install & Load Libraries

The first step, is to start installing and loading all the necessary libraries needed for our analysis. This is done by calling the below mentioned script.

## Loading required package: xts
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: TTR
## Version 0.4-0 included new data defaults. See ?getSymbols.
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
## Highcharts (www.highcharts.com) is a Highsoft software product which is
## not free for commercial and Governmental use
## 
## Attaching package: 'mice'
## The following objects are masked from 'package:base':
## 
##     cbind, rbind
## Loading required package: Rcpp
## ## 
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.5, built: 2018-05-07)
## ## Copyright (C) 2005-2018 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
## 
## Attaching package: 'tidyr'
## The following object is masked from 'package:mice':
## 
##     complete
## 
## Attaching package: 'leaflet'
## The following object is masked from 'package:xts':
## 
##     addLegend
## Loading required package: carData
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:lubridate':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday,
##     week, yday, year
## The following objects are masked from 'package:xts':
## 
##     first, last
## [1] "[3.4s]: All necessary packages installed and loaded"

LOAD DATA

To load stock data from Yahoo, we use the get.hist.quote function, as we want to get the stocks with dates and only their Close prices, which are needed for our analysis. This function, helps us get the necessary information from the first step, without having to download everything and subset!

nasdaq_symbols <- read_excel("dataset/nasdaq_symbols.xlsx")
all_stocks <- NULL
increment <- 1
for (i in nasdaq_symbols$Symbol){
  temp_stocks <- get.hist.quote(instrument = i,quote = 'AdjClose', provider = 'yahoo', compression = 'd', quiet = TRUE, retclass = 'zoo')
  if(increment == 1){
    all_stocks <- temp_stocks
  }
  else{
    all_stocks <- merge(all_stocks,temp_stocks)
  }
  increment <- increment + 1
}
## 'getSymbols' currently uses auto.assign=TRUE by default, but will
## use auto.assign=FALSE in 0.5-0. You will still be able to use
## 'loadSymbols' to automatically load data. getOption("getSymbols.env")
## and getOption("getSymbols.auto.assign") will still be checked for
## alternate defaults.
## 
## This message is shown once per session and may be disabled by setting 
## options("getSymbols.warning4.0"=FALSE). See ?getSymbols for details.
## 
## WARNING: There have been significant changes to Yahoo Finance data.
## Please see the Warning section of '?getSymbols.yahoo' for details.
## 
## This message is shown once per session and may be disabled by setting
## options("getSymbols.yahoo.warning"=FALSE).

CHECK FOR Nas …

Checking for NA values is necessary, as our script is intended to serve any given time frame. This means that some stocks at certain time frames will not have values for closing price.

# Check for NAs
summary(all_stocks)
nrow(all_stocks)

… AND DEAL WITH THEM

In order to deal with NA values, as they will affect the return calculations, it is decided to impute the nearest Non - NA value, to any stock price that might have NA.

all_stocks_complete <- all_stocks %>%
  na.locf(., na.rm = FALSE) %>%
  na.locf(., fromLast = TRUE)

nrow(all_stocks_complete)
## [1] 7047

CALCULATE THE DAILY LOG RETURN

After imputing the NA values, we can calculate the daily log return. Several benefits of using log returns, both theoretic and algorithmic.

We need to address one thing though: - The last row will have NAs as it does not have a previous day to be calculated, so it is removed.

all_return_daily <- diff(log(all_stocks_complete))

all_return_daily <- all_return_daily[-1,]

ADD DAY, MONTH, YEAR

At this step, we add the Day, Month and Year as separate columns to oUr datasets, for possible further use.

all_stocks_complete$day <- day(index(all_stocks_complete))
all_stocks_complete$month <- month(index(all_stocks_complete))
all_stocks_complete$year <- year(index(all_stocks_complete))

all_return_daily$day <- day(index(all_return_daily))
all_return_daily$month <- month(index(all_return_daily))
all_return_daily$year <- year(index(all_return_daily))

LAST CHANGES FOR FINAL DATASET

At this stage, we do the last adjustments like

AND THIS IS OUR FINAL DATASET (data per day)

tail(all_stocks_complete)
##    day month year symbol stock_price       date
## 1:  12    12 2018   XLNX       90.54 2018-12-12
## 2:  13    12 2018   XLNX       89.13 2018-12-13
## 3:  14    12 2018   XLNX       88.82 2018-12-14
## 4:  17    12 2018   XLNX       87.18 2018-12-17
## 5:  18    12 2018   XLNX       89.23 2018-12-18
## 6:  19    12 2018   XLNX       85.22 2018-12-19
tail(all_return_daily)
##    day month year symbol stock_log_return       date
## 1:  12    12 2018   XLNX      0.026978790 2018-12-12
## 2:  13    12 2018   XLNX     -0.015695809 2018-12-13
## 3:  14    12 2018   XLNX     -0.003484095 2018-12-14
## 4:  17    12 2018   XLNX     -0.018636903 2018-12-17
## 5:  18    12 2018   XLNX      0.023242393 2018-12-18
## 6:  19    12 2018   XLNX     -0.045981180 2018-12-19

GETTING VOLATILITY (SD) TO PLAY

At this step we create different aggregations of the data regarding Mean and Standard Deviation. All of them can be seen on the script, but below we mention the one used for our analysis.

return_mean_daily_by_symbol <- all_return_daily[, list(avg_stock_return = mean(stock_log_return),
                                                       sd_stock_return  = sd(stock_log_return)),
                                                by = list(symbol)]

return_mean_daily_by_symbol <- merge(return_mean_daily_by_symbol,nasdaq_symbols, by.x = 'symbol', by.y = 'Symbol')
##      symbol avg_stock_return sd_stock_return                          Name
##   1:    AAL     7.970379e-05      0.02913133   American Airlines Group Inc
##   2:   AAPL     9.457965e-04      0.02895474                     Apple Inc
##   3:   ADBE     7.721930e-04      0.02998959                    Adobe Inc.
##   4:    ADI     6.603589e-04      0.02875743            Analog Devices Inc
##   5:    ADP     5.670731e-04      0.01549486 Automatic Data Processing Inc
##  ---                                                                      
##  99:   WDAY     1.649721e-04      0.01103229                   Workday Inc
## 100:    WDC     4.057916e-04      0.03892628          Western Digital Corp
## 101:   WYNN     3.558244e-04      0.02311840              Wynn Resorts Ltd
## 102:    XEL     4.189168e-04      0.01531325               Xcel Energy Inc
## 103:   XLNX     6.591518e-04      0.03155594                    Xilinx Inc

FUNCTIONS CREATED FOR RISK ANALYSIS

At this step, 2 functions are created: One to return stock symbol with maximum risk in a given period of time between start and end. (Risk = average risk, calculated as standard deviation of daily returns)

MaximumRisk <- function(d=all_return_daily, start, end)
{
  data.period <-subset(all_return_daily, all_return_daily$date >= as.Date(start) & all_return_daily$date <= as.Date(end))
  for(i in nrow(data.period)){
    print(data.period[which(data.period$stock_log_return == max(data.period$stock_log_return)),c('symbol','stock_log_return')])
  }
}

MaximumRisk(all_return_daily, '2016-01-01', '2016-01-31')
##    symbol stock_log_return
## 1:     FB         0.144286

FUNCTIONS CREATED FOR RISK ANALYSIS (II)

One to return stock symbol with lowest risk in a given period of time between start and end. (Risk = average risk, calculated as standard deviation of daily returns)

LowestRisk <- function(d=all_return_daily, start, end)
{
  data.period <-subset(all_return_daily, all_return_daily$date >= as.Date(start) & all_return_daily$date <= as.Date(end))
  for(i in nrow(data.period)){
    print(data.period[which(data.period$stock_log_return == min(data.period$stock_log_return)),c('symbol','stock_log_return')])
  }
}

LowestRisk(all_return_daily, '2016-01-01', '2016-01-31')
##    symbol stock_log_return
## 1:   EBAY       -0.1329909

FUNCTIONS CREATED FOR PLOTING

At this part we create functions that will provide time series plots for the daily stock return. (ggplotly = in order for the interactive chart to be faster, we subset only from 2000 and after)

after_2000_return <- all_return_daily[date>'2000-01-01']
plotting_returns <- function(x){
  if (x %in% nasdaq_symbols$Symbol){
    
    ggplotly(ggplot(data=after_2000_return[after_2000_return$symbol == x,], aes(x=date, y=stock_log_return))+
               geom_line(color='cornflowerblue')+
               ggtitle("Daily Stock Returns") +
               xlab("Time Frame") + ylab("Stock Return")+
               theme_minimal()
    )
  }
}

PLOTING DAILY RETURN

FUNCTIONS CREATED FOR PLOTING (II)

At this part we create functions that will provide time series plots for the daily stock price. (ggplotly =in order for the interactive chart to be faster, we subset only from 2000 and after)

after_2000_stocks <- all_stocks_complete[date>'2000-01-01'] 
plotting_stocks_price <- function(x){
  if (x %in% nasdaq_symbols$Symbol){
    
    ggplotly(ggplot(data=after_2000_stocks[after_2000_stocks$symbol == x,], aes(x=date,y=stock_price))+
               geom_line(color='cornflowerblue')+
               ggtitle("Daily Stock Price") +
               xlab("Time Frame") + ylab("Stock Price")+
               theme_minimal()
    )
  }
}

PLOTING DAILY STOCK PRICE

LINEAR MODEL

In order to test the relationship between mean and volatility, we will apply a linear regression model. First lets have a look at the relationship between the two through a scatter plot.

We then split our data to train and test sample, by 80% - 20% respectively.

set.seed(2018)
train.size <- 0.8
train.index <- sample.int(length(return_mean_daily_by_symbol$avg_stock_return), round(length(return_mean_daily_by_symbol$avg_stock_return) * train.size))
train.sample <- return_mean_daily_by_symbol[train.index,]
test.sample <- return_mean_daily_by_symbol[-train.index,]

Develop the model with the mean return as response, and the volatility as independent variable.

model_0 <- lm(avg_stock_return~sd_stock_return, data = train.sample)

Then, we analyze the effect of the absolute value of the VIX on stock market returns. Seeing the output of the regression analysis, we can conclude that the VIX and the returns have a slightly positive relationship with a coefficient of -0.01363 and an (intercept) coefficient of 0.0001795. Considering the regression output we can conclude that the level of the VIX has a statistically significant effect on stock market returns, at a 20% level.

lm_stats <- summary(model_0)
lm_stats
## 
## Call:
## lm(formula = avg_stock_return ~ sd_stock_return, data = train.sample)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -8.195e-04 -1.375e-04 -9.140e-06  1.237e-04  7.260e-04 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1.384e-04  8.343e-05   1.659    0.101    
## sd_stock_return 1.492e-02  3.210e-03   4.647 1.31e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0002509 on 80 degrees of freedom
## Multiple R-squared:  0.2126, Adjusted R-squared:  0.2027 
## F-statistic:  21.6 on 1 and 80 DF,  p-value: 1.305e-05

As we cannot be accurate about our estimates, let’s calculate the confidence intervals of the Volatility.

confint(model_0, level=.95)
##                         2.5 %       97.5 %
## (Intercept)     -2.763569e-05 0.0003044437
## sd_stock_return  8.530697e-03 0.0213079420
plot_summs(model_0, scale = TRUE, plot.distributions = TRUE, inner_ci_level = 0.95)

CLUSTERING

In order to do our Clustering Analysis we need to make some adjustments

return_mean_daily_0 <- return_mean_daily_by_symbol
return_mean_daily_0$symbol <- NULL
return_mean_daily_0$Name <- NULL

After that we need to rescale our variables in order ot have better results.

rescale(return_mean_daily_0$avg_stock_return)
rescale(return_mean_daily_0$sd_stock_return)

Then we implement the k-means algorithm in order to get our clusters. After several trials, it is decided that we take as k=3 number of clusters.

results <- kmeans(return_mean_daily_0, 3)
table_cluster <-table(return_mean_daily_by_symbol$symbol, results$cluster)
head(table_cluster,10)
##       
##        1 2 3
##   AAL  0 0 1
##   AAPL 0 0 1
##   ADBE 0 0 1
##   ADI  0 0 1
##   ADP  0 1 0
##   ADSK 0 0 1
##   ALGN 0 0 1
##   ALXN 0 0 1
##   AMAT 0 0 1
##   AMGN 1 0 0

In order to visualize our results we create several plots.

Plot 1

Plot 2

Plot 3

Another thing that should be noticed and observed is the existence of outliers. Before that though, we need to calculate twi parameters:

centers1 <- results$centers[results$cluster, ]
distances <- sqrt(rowSums((return_mean_daily_0 - centers1)^2))
outliers <- order(distances, decreasing=T)[1:5]
print(outliers)
## [1] 85 66 81 88 51
print(return_mean_daily_0[outliers,])
##    avg_stock_return sd_stock_return
## 1:     4.247013e-05      0.04849946
## 2:     1.013815e-03      0.04350645
## 3:     4.027235e-04      0.04288946
## 4:     6.297767e-04      0.04180016
## 5:     4.982330e-04      0.04093417

So, let’s detect them through plotting!

CONCLUSION

So, taking into consideration all the above we can make the below mentioned points:

  1. Through the plotting functions any stock can be explored, in any desired time frame.

  2. Through risk functions, any stock can with high or low risk can be detected, also for the same time frame (or any other)

  3. Lastly, after we have detected our desire stock (or stocks) from the previous mechanisms, we can detect at which cluster group this stock lays, in order to expand our portfolio to other stocks with common characteristics.

All in all, the purpose of my assignment was to provide anyone with the most basic and important tools to be able to make their own analysis regarding stocks and make their own decisions regarding what is important (or not) or what is of risk (or not) for them.